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Abstract: This paper presents FADE-IOG - an integrated solution for modem multichannel mea¬ 
surement systems. Its main aim is a low latency, reliable transmission of measurement data from 
FPGA-based front-end electronic boards (FEBs) to a computer-based node in the Data Acquisition 
System (DAQ), using a standard Ethernet 1 Gbps or 10 Gbps link. In addition to transmission of 
data, the system allows the user to send reliably simple control commands from DAQ to FEB and 
to receive responses. 

The aim of the work is to provide a possible simple base solution, which can be adapted by the end 
user to his or her particular needs. Therefore, the emphasis is put on the minimal consumption 
of FPGA resources in FEB and the minimal CPU load in the DAQ computer. The open source 
implementation of the FPGA IP core and the Finux kernel driver published under permissive license 
facilitates modifications and reuse of the solution. 

The system has been successfully tested in real hardware, both with 1 Gbps and 10 Gbps links. 
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1. Introduction 

In modem multichannel measurement systems, it is often necessary to transfer multiple data streams 
from detectors to computers responsible for processing of data. Especially the introduction of 
a triggerless approach in High Energy Physics (HEP) experiments ([||-^) increased demand on 
the amount of data that must be transferred to the Data Acquisition (DAQ) System, and therefore 
also on the number of links that must be provided. The Eront End Boards (EEB) are typically built 
using EPGA chips, which nowadays are often equipped with gigabit or multi-gigabit transceivers. 
That enables the implementation of a broad range of high-speed communication interfaces Si- 
When selecting the appropriate solution, we must take into account additional requirements like 
a length of the link (which in some experiments may reach even a few hundred meters 0) and 
an electrical insulation (so optical fiber is preferred). To reduce the total cost of implementation 
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of multiple links, we should use a standard interface to benefit from price reduction thanks to the 
mass production of transceivers and other components of the link infrastructure. 

Considering the requirements mentioned above, it seems that the Ethernet link, using the SEP 
or SEP+ optical transceivers is the optimal solution. Broad use of the Ethernet technology has re¬ 
sulted in a significant reduction in the price of components (namely SEP-i- transceivers) needed to 
implement the 10 Gbps Ethernet link both on the computer side and on the EPGA side. The achiev¬ 
able price of an optical 10 Gbps SEP-i- transceiver is approximately $85 for a single channel. It can 
be further reduced when ordering a bigger batch, or when using four channel QSEP-i- transceivers 
(price approximately $280 for four channels). 

1.1 Required functionality of the transmission system 

The aim of this work is to create a minimal but extensible solution. Therefore, it is important to 
define fhe requirements that should be fulfilled by such a system. 

• Possibility to work with 1 Gbps (for price-sensitive applications) and 10 Gbps (for typical 
applications) Ethernet links 

• Reliable transport of data stream with maximal throughput and minimal latency (because 
the latency directly affects the amount of memory needed to buffer transmitted but not yet 
confirmed data). 

• Possibility to control EEBs and to check their status from the DAQ side of the link (even 
though in a typical DAQ system there is yet another separate communication channel for 
configuration and diagnostics of EEBs). 

• Open source implementation, that may be modified to suit the needs of the particular exper¬ 
iment. 

• Ability to work with different PHY interfaces (copper or optical), depending on the needs of 
the particular experiment. 

2. State of the art 

The standard solution for the reliable transfer of data via an Ethernet network is the TCP/IP pro¬ 
tocol. Unfortunately, this protocol has serious disadvantages when used in an EPGA. It has been 
optimized mainly for the transport of data in wide area networks with multiple routers between 
communicating devices. Therefore, it contains many features related to routing of data pack¬ 
ets, fragmenting the packets and with sharing the link bandwidth between multiple connections. 
The TCP/IP also assumes that data may be transported via untrusted networks, and therefore it im¬ 
plements sophisticated algorithms protecting the communication against malicious activity. The fi¬ 
nal result is that implementation of the full TCP/IP stack in an EPGA is complex and resource hun¬ 
gry. Some implementations rely on a CPU implemented in an EPGA or embedded in an EPGA ||^, 
but such solutions do not allow full utilization of the 10 Gbps link throughput. There are some 
commercial implementations of 10 Gbps hardware TCP/IP stacks for an EPGA, but they are closed 
and expensive solutions |^[T^. 
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2.1 Reduced TCP/IP developed at CERN 


An interesting attempt to reduce resource consumption of the hardware-based TCP/IP implemen¬ 
tation is a solution developed at CERN and described in |j^ |^. The authors have reduced the 
functionality of the TCP/IP so that it is possible to implement it in an FPGA without implementing 
any soft CPU core. The implementation provides only unidirectional transmission. The authors 
did not implement timestamps, selective acknowledgments or out of band data. In addition, certain 
mechanisms have been significantly simplified - e.g. fhe congesfion managemenf. This solution 
however sfill relies on fhe exfemal DDR memory used as a TCP sockef buffer. The advanfage of 
fhis solufion is fhaf fhe receiver may be a sfandard compufer wifh TCP/IP sfack provided by fhe op¬ 
erating sysfem. However, fhis solufion also leads fo significanf CPU load. As fhe aufhors sfafe 
fhemselves “Running one 10 Gbps TCP sfream can easily safurafe one of fhe CPU cores.” 

Anofher significanf disadvanfage is fhe closed source nafure of fhis solution. No sources have 
been released, so if can nol be a basis for an open, extensible solution. 

2.2 Avoiding the TCP/IP complexity 

To avoid the implementation of a complex TCP/IP stack in the FPGA and to reduce the load of 
CPU in the receiving computer, it is desirable to use a simpler protocol. Usage of the UDP proto¬ 
col instead of TCP is not optimal. The UDP protocol does not assure reliable transfer of data, so it 
is necessary to implement additional mechanisms ensuring reliability. At the same time, the UDP 
protocol and all IP protocols still require significant overhead associated with the routing of packets 
(datagrams). However, the connection between FEBs and DAQ should not contain any routers, as 
they increase link latency, which in turn leads to an increase of memory needed to buffer the trans¬ 
mitted and not yet confirmed data. 

There are two possible link topologies. In the case where Ethernet interfaces in both - FEBs 
and DAQ computers have the same speed the point-to-point connections will be used (see Fig¬ 
ure ||a). If the Ethernet interface in the DAQ computer offers a higher speed (e.g. 40 Gbps), it 
is possible to connect a few FEBs to a single network card via a 10 Gbps/40 Gbps switch (see 
Figure [l]b). For such very simple networks, where Ethernet frames are passed either directly or via 
a Fayer 2 network switch, the best solution is to develop an optimized Fayer 3 protocol using raw 
Ethernet frames. 

2.3 Ethernet Proxy - EPRO as possible solution 

The protocol and Finux kernel driver based on the above assumptions was developed at the AGH 
University of Science and Technology and described in [j^ Q. The proposed solution implements 
not only a reliable transport of the data stream, but also some additional functions. Those functions 
include different types and priorities of data, or the possibility to send the same data to more than 
one destination. The protocol is implemented for a 1 Gbps link and uses the standard Xilinx MAC 
implementation. Unfortunately, this solution like the previous one is not open. The authors did not 
publish sources, so it is not possible to modify it to work with higher speed 10 Gbps links or to 
adjust it to the particular experiment’s requirements. 
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Figure 1. Possible topologies of Ethernet-based data transmission from FEB to DAQ. a) The case where 
Ethernet interface speeds in FEBs and the computer are equal, b) The case where the computer offers a faster 
Ethernet interface. 


2.4 First version of the FADE protocol 


Another possible solution is the author’s open source FADE protocol described in [151. This pro¬ 
tocol provides reliable transmission of data from an FPGA to a computer through 1 Gbps Ethernet 
links. The resource consumption in the EPGA is kept to a minimal level and may be adjusted using 
a parametrized VHDE code. Instead of a complex standard MAC, simplified state machines are 
used to receive and send packets. These are sufficient for full-duplex Ethernet links with granted 
link bandwidth. The initial version of the FADE protocol worked correctly with 1 Gbps links, but 
an attempt to simply modify the FPGA IP core for operation with 10 Gbps Ethernet PHY revealed 
problems with efficiency. Therefore, fhe whole code was significanlly modified. Modifications 
included simplificafion of fhe packefs managemenf. For example, fhe concepf of “sefs of packefs” 
described in was dropped in favor of a simple descripfion of fhe dafa sfream as a continuous 
sequence of packefs. Anofher modification was fhe addition of fhe possibilify fo perform simple 
confrol and diagnostic operations via fhe Ffhernef link while fhe original FADE profocol allowed 
only fo send START and STOP commands. 


This article describes fhe implemenfafion of fhe new version of fhe FADE profocol named 
FADE-IOG. 
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Table 1. Structure of the Ethernet frames used by the FADE-IOG transmission protocol. 


Standard Ethernet header 

Protocol 

version 

Payload 

Filler 

Check¬ 

sum 

Source 

MAC 

Destina¬ 

tion 

MAC 

OxEADE 

0x0100 

Payload bytes 

N*0xa5 

ECS 

6 bytes 

6 bytes 

2 bytes 

2 bytes 

length depends on 
the type of frame 

variable length used 
when the frame is 

too short 

4 bytes 


3. Implementation of the FADE-IOG protocol 

The FADE-IOG protocol is aimed at the transmission of the continuous data stream consisting 
of 64-bit words. To better utilize the link bandwidth, data are transmitted using Ethernet jumbo 
frames. The data packets contain 1024 data words (8192 bytes) and some additional information 
(MTU should be set to 9000 in the network interface configuration). The number of data words in 
a packet equal to the power of two was chosen to simplify packet management both in the EPGA 
and in the receiving computer, as it is described later. When the transmission is stopped, the last 
packet may contain fewer data words. In such a case, the last data word contains the number of 
valid words in that packet (between 0 and 1023). Because the protocol is supposed to be used as 
the only protocol in private networks, the private, unofficial Ethertype Oxfade is used. To differ¬ 
entiate frames of the EADE-IOG from the old EADE frames, and to allow further modifications of 
the protocol, the protocol version number is transmitted after the Ethertype field. This number is 
equal to 0x0100 in the current version*. Because the Ethernet link does not warrant the reliable de¬ 
livery of frames, it is necessary to implement a simple acknowledgment/retransmission algorithm, 
which uses special shorter acknowledgment frames. Still others short frames are necessary to allow 
transmission of simple control or diagnostic commands via Ethernet link. The general structure of 
the EADE-IOG Ethernet frame is shown in Table |I|, and the payload contained in frames of different 
types is shown in Table 

3.1 Implementation of the protocol in the FPGA 

The reliable transmission of data via an unreliable channel (like an Ethernet link) requires retrans¬ 
mission. Therefore, it is necessary to buffer the data that have been transmitted, but have not yet 
been confirmed by the receiving computer. 

To keep the algorithm controlling the retransmission as simple as possible, the memory buffer 
in the EPGA has a length of M = data packets. Each data packet is 1024 words (8192 bytes) 
long. Thus, the lower bits of the number of the data packet in the data stream may be directly used 
to define its position in the memory buffer. The length of this buffer also defines the transmission 
window of the protocol. At every moment, only a packet from a certain set of M consecutive 
packets may be transmitted via the link. The Nfpga value may be configured before synthesis of 
the core and compilation of the protocol driver (described in Section^. 

*In the first version of the FADE protocol, this field contained the type of the frame and could be a value from 
the range 0x01 to 0x05 or the value 0xa5a5. 
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Table 2. Structure of the payload in different Ethernet frames used by the FADE-IOG transmission protocol, 
a) Data acknowledgment frame (from computer to EPGA) 


0x0003 
(ACK) or 
0x0004 
(NACK) 

Frame sequence number 

Packet number in the data stream 

Transmission delay 

2 bytes 

2 bytes 

4 bytes 

4 bytes 


b) User command request (from computer to EPGA) 


Command 

code 

Command sequence number 

Command argument 

2 bytes 

2 bytes 

4 bytes 


c) Standard data packet (from EPGA to computer) 


0xA5A5 

Frame 

sequence 

number 

Packet 

number in 

data 

stream 

Transmis¬ 

sion 

delay 

Command 

* 

response 

data 

2 bytes 

2 bytes 

4 bytes 

4 bytes 

12 bytes 

8192 bytes 


d) Last data packet (from EPGA to computer) 


0xA5A6 

Frame 

sequence 

number 

Packet 

number in 

data 

stream 

Transmis¬ 

sion 

delay 

Command 

* 

response 

data 

number of 

valid words 

2 bytes 

2 bytes 

4 bytes 

4 bytes 

12 bytes 

8184 bytes, 
not all must 

be valid 

8 bytes 


e) Command response packet (from EPGA to computer) 


Filler 

Command 

* 

response 

2 bytes 

12 bytes 


Command response field in the data packet or command response packet: 


Command 

code 

Command 

sequence 

number 

User defined return 

value 

2 bytes 

2 bytes 

8 bytes 


Each packet is associated with its descriptor shown in Figure 

The structure of the IP core implemented in the EPGA is shown in Figure |[ The Ethernet 
Receiver and Ethernet Sender blocks are simple state machines, replacing the standard Ethernet 
MAC. They are connected to the external Ethernet PHY or the internal Ethernet PHY equivalent 
implemented in the EPGA - like the Xilinx PC S/PM A core [|T^. 

64-bit data words provided by the data source are written to the data packet pointed by 
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Figure 2. Data packets in the FPGA memory and their descriptors shortly after the start of transmission. 
Bit flags: V-Valid, S-Sent, C-Confirmed, F-Flushed (used when the transmission is finished). The “Pkt” field 
stores the 31-bit number of the packet in the data stream. The "Seq" field stores the 16-bit frame sequence 
number used by the fast retransmission algorithm. The packets associated with descriptors 1 and 3 contain 
valid data. They have been sent but are not confirmed yet. Please note, that the sequence numbers are 
higher than the packet numbers because three packets were retransmitted before. The packet associated with 
descriptor 2 contains valid data, it has been sent and is confirmed. The packet associated with descriptor 4 
contains valid data, but it has not yet been sent (therefore its sequence number is lower, as it is the sequence 
number of the packet that previously occupied this slot). Other descriptors are free. Therefore, their flags 
are cleared. 
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Figure 3. Structure of the FPGA IP core supporting the FADE-IOG protocol. 


the Head pointer. When this paeket is filled, it is marked as ready for transmission (V=l). Then 
the Descriptor Manager eheeks if it is possible to move the Head pointer to the next position. If 
the next position is the one pointed by the Tail pointer, it means that the buffer is full. In this ease, 
the ready status of the eore is deasserted until the paeket pointed by the Tail pointer is aeknowl- 
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edged, and the Tail pointer is moved to the next position. 

The Ethernet Receiver block receives packets, checks their checksum, and writes informa¬ 
tion from correctly received packages to the Acknowledgment and Commands FIFO (Ack & Cmd 
FIFO). Additionally, the Ethernet Receiver itself executes a few high priority commands like 
START, STOP, and RESET. The START and STOP commands are still written to the Ack & 
Cmd FIFO to ensure generation of their confirmation. The RESET command causes the reset 
of the whole EADE-IOG core and is therefore not confirmed af all. 

The Descriptor Manager reads commands from fhe Ack & Cmd FIFO. If fhe received com¬ 
mand is fhe packef acknowledgmenf (ACK) or negative packef acknowledgment (NACK), the De¬ 
scriptor Manager handles it itself, as these commands are not confirmed. Ofher commands are 
passed fo fhe Command Processor, which execufes fhe command and generafes fheir confirmafion. 

The packef acknowledgmenf (ACK) command sefs fhe C (Confirmed) flag in fhe descripfor 
of fhe acknowledged packef if fhis packef is still kepi in fhe buffer^. If fhe received ACK packef 
conlains a packef number bigger lhan fhe number of fhe Iasi transmilted packef^, a profocol error 
is delected. 

After all commands available from fhe Ack & Cmd FIFO are executed fhe Ethernet Receiver 
block fries fo move fhe Tail pointer freeing all packels lhaf have fhe C flag sef in fheir descripfor. 
All flags in descriptors of freed buffers are cleared. After fhal operation, if Ihere is a free place in 
fhe buffer, fhe ready sfafus of fhe core is asserfed again. 

Anolher acfivily performed by fhe Descriptor Manager is fhe Iransmission and refransmission 
of packels. If continuously browses fhe packef buffer and finds packels fhal have fhe V flag sef, 
buf fhe C flag unsel. Those packels are passed fo fhe Ethernet Sender block for Iransmission or 
retransmission. 

The last hardware block is the Command Processor, which may work in the same clock do¬ 
main as the Descriptor Manager but may also operate in another (even its own) clock domain. 
The Command Processor executes the received command and after the result or status is ready it 
builds the command response and passes it to the Descriptor Manager. The command response is 
then transmitted either in the nearest data packet or the dedicated command response packet (if no 
data packet is currently waiting for transmission or retransmission). 

The core counts transmitted data packets and retransmitted data packets to avoid network con¬ 
gestion or computer overload. The ratio of those counts is then calculated. If the detected ratio 
of retransmitted packets is too high which may be a symptom of an overload, the delay between 
transmitted packets is increased. If the ratio of retransmitted packets is very small, this delay is de¬ 
creased. Thresholds used by the delay adaptation algorithm are parametrized and may be changed 
before synthesis of the core. Eor debugging purposes, current transmission delay is included in 
packets sent from the EPGA to the computer (field Transmission Delay in Table ^). 

3.2 Early retransmission mechanism 

The retransmission algorithm described has one significant disadvantage. If the packet pointed by 

^It is possible that the core receives a delayed duplicated acknowledgment packet. In that case, the buffer no longer 
contains the corresponding descriptor. 

^The packet number wraps every packets. Therefore comparison of those numbers is defined as follows: 
Ni>N 2 if{Ni-Ni) (mod 2^2) <2^'. 



a) 



b) 



Figure 4. Operation of the early retransmission mechanism: a) without sequence numbers, b) with sequence 
numbers. In both cases, packet 2 is lost, and the ACK for packet 4 is lost. At time=7, packet 3 gets 
confirmed, but no ACK for packet 2 has been received earlier. Therefore, packet 2 is scheduled for immediate 
retransmission. At time=9, packet 5 gets confirmed, but neither packet 2 nor packet 4 have been confirmed 
yet. Therefore in case (a) both those packets are scheduled for immediate retransmission. In case (b), 
the sequence number is checked. At that moment, the last sequence number for packet 2 is equal to 7 and for 
packet 4 to 4. As the received ACK for packet 5 has sequence number equal to 5, only packet 4 is scheduled 
for immediate retransmission. 


the Tail pointer (or its acknowledgment) is lost, the space in the packet buffer will not be freed until 
this packet is retransmitted again and successfully confirmed. In the described implementation, this 
packet will be only retransmitted after all other pending packets are transmitted or retransmitted. 
Therefore, the core will not accept new data for a significant amount of time. 

The performance of the algorithm may be improved, if such a packet is retransmitted as soon 
as its loss (or loss of its acknowledgment) is detected. The clear sign of such an event is when 
the core receives acknowledgment of the packet that has been transmitted after that one. Such 
a solution is similar to the “Fast retransmit” used in the TCP protocol [|T7[]. 

In the simplest solution after reception of the acknowledgment of any packet all unconfirmed 
packefs wifh fhe packef number smaller fhan fhe one received will be refransmiffed. Unforfunafely, 
fhis kind of simplisfic implemenfafion based only on fhe number of fhe packef in fhe dafa sfream 
is nof optimal. If loss of yef anofher packef is defecfed before fhe “early refransmiffed” packef is 
confirmed fhis packef will be unnecessarily refransmiffed once again (see Figure ^). To prevenf 
fhis, fhe dafa packefs are labeled addifionally wifh tht frame sequence number incremenfed affer 
every fransmission. The lust frame sequence number used fo fransmif fhe particular dafa packef is 
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stored in the packet’s descriptor (the “Seq” field in Figure This/rame sequence number is copied 
to the acknowledgment packet. When loss of the packet is detected, it is possible to retransmit 
early only those packets that have a last/rame sequence number smaller^ than the acknowledgment 
packet just received (see Figure ^). 

3.3 Execution of user commands 

To ensure that each command is delivered reliably, executed exactly once and its results are deliv¬ 
ered successfully to the computer, the command sequence numbers (CSNs) are used. 

Whenever a new command is sent to the FPGA core, the CSN is increased. That enables 
discarding of possible duplicated responses to previous commands. After a new command is sent, 
the computer waits for the response for a certain configurable amounf of fime. If fhe compufer 
does nof receive fhe command response packef in fhe declared lime period, if slafes lhaf eilher 
fhe command packef or fhe response packef were losl. In fhal case, fhe compufer resends fhe same 
command once again. 

When fhe FPGA core correcfly receives fhe command packef if firsl checks ils CSN. If if is 
fhe same as in fhe Iasi serviced command (which means fhal fhe response packef was losl, and 
command was resenl), fhe core only resends fhe response for fhal Iasi command. If fhe CSN 
is differenl, fhe core slores if, executes fhe command and fhen sends fhe response packef wilh 
fhe same CSN. 

The command response is sen! in fhe pending dala frame if if is available or (when currenfly 
no dala packef is wailing for fransmission or relransmission) in fhe dedicated command response 
packef (see Table ^c-e). 

3.4 Resource consumption 

The FADE-IOG core supporting 10 Gbps links was successfully synthesized for the Kintex 7 
xc7k325tffg900-2 FPGA used in KC705 [ jlsl ] and AFCK [ |I^ ] boards. The version supporting 
1 Gbps links was successfully synthesized for the Spartan 6 xc6slx45csg324-2 FPGA used in Atlys 
boards [^0|]. Synthesis for the Kintex 7 FPGA was performed for two sizes of the memory buffer 
{NfPGA = 32 and Nfpga = 16)- Due to the limited amount of internal memory synthesis for the 
Spartan 6 FPGA was performed only with Nfpga = 16. Results of the synthesis are presented 
in Table ^ It is visible that the FADE-IOG core leaves a reasonable amount of logic resources 
available for the user to implement FEB blocks. Eor Nfpga = 32 the xc7k325tffg900-2 can easily 
accommodate the EADE-IOG core operating four 10 Gbps links. Eor Nfpga = 16 the same chip 
can work with even eight such links. 

4. Linux driver 

GNU/Einux is widely used in modern data acquisition systems. As a free and open source system, 
it is a perfect platform for such an open solution as the one proposed in this paper. 

Because EADE-IOG uses a non-standard Ethernet protocol, it is necessary to implement a ded¬ 
icated kernel driver as a protocol handler responsible for the reception of the Ethernet frames of 

^The frame sequence number wraps every 2*® packets. Therefore comparison of those numbers is defined as follows: 
Ni > iV 2 if (W-N 2 ) (mod2'®) <2'5. 
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Table 3. The per-link resource usage of the FADE-IOG core synthesized with different size of the memory 
buffer for different FPGA chips. 


FPGA chip and 
buffer length 

Resource 

Used units 

Available units 

The percentage 
of usage 

Kintex 7 
xc7k32tffg900-2 
Nfpga — 32 

Slices 

801 

50,950 

1.57% 

Slice FUTs 

2,107 

203,800 

1.03% 

Slice registers 

1,240 

407,600 

0.3% 

BRAM tiles 

65 

445 

14.6% 

Kintex 7 
xc7k32tffg900-2 
Nfpga ~ 16 

Slices 

756 

50,950 

1.48% 

Slice FUTs 

2,065 

203,800 

1.01% 

Slice registers 

1234 

407,600 

0.3% 

BRAM tiles 

33 

445 

7.42% 

Spartan 6 
xc6slx45csg324-2 
Nfpga ^16 

Slices 

611 

6,822 

8.96% 

Slice FUTs 

1599 

27,288 

5.86% 

Slice registers 

1227 

54576 

2.25% 

BRAM 

blocks 

68 

116 

58.6% 


type Oxfade. Similarly to the solutions described in [13] and [15], the protocol handler is installed 
using the dev_add_pack function. Whenever the Ethernet frame with Oxfade type is received, 
the callback function in the driver is called. 

The driver may service one or more FPGA-based FEBs. They can be connected to separate 
Ethernet cards, or (via a switch) to the same Ethernet card (see Figure ||). Each connected FPGA- 
based FEB is serviced via a dedicated character device {/dev/13_fpga%d, where %d is replaced 
with subsequent numbers starting from 0). The maximum number of serviced FEBs is declared 
when loading the driver using the max_slaves parameter. The character device may be opened and 
configured for communication with the FPGA (slave) using the particular MAC address. After that 
a slave context is created, describing the state of communication with that slave. One of the com¬ 
ponents of the slave context is the receiver packet buffer, which stores received data. The amount 
of memory available in the computer is significantly higher than the amount of internal memory 
in the FPGA. Therefore, this buffer may be much longer than the memory buffer in the FPGA 
core (which has a length of packets as described in Section ^). Its length is chosen to be 

ff^cpu packets. Thus, the lower bits of the packet number in the data stream may be directly used as 
the number of the corresponding packet slot in the receiver packet bujfer. That is a circular buffer 
with the Head pointer and the Tail pointer pointing respectively to the next byte to be written and 
to the last byte not yet read. 


4.1 Packet reception routine 

The callback function my_proto_rcv is called when the packet of Oxfade type is received. It first 
checks if the packet arrived from the correct (“opened”) FEB. If not, the driver sends a “reset” com¬ 
mand to the misbehaving FEB. If the correct FEB slave is found, further operations are performed 
on the slave context of that FEB. 

The function checks the protocol version. If it is incorrect, the appropriate error flag is set, and 
the packet is dropped. Then the type of the received packet is checked. If it is a command response 
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packet, and if there is a thread waiting for completion of this command, the result is copied from 
the packet to the user-space buffer. Then the waiting thread is woken up, and the function returns. 
If the packet is neither a response packet nor a data packet, the error flag is set, and the packet is 
dropped. 

If none of the above conditions applies, the packet is handled as a data packet. First the func¬ 
tion checks the command response section. If there is a thread waiting for the completion of 
the corresponding command, the result is copied to the user space. Then the waiting thread is wo¬ 
ken up (like in the case of a dedicated command response packet). Afterward, the data part of the 
packet is handled. The receiver packet buffer stores the 32-bit packet number of the last received 
and confirmed packet for each packet slot. The number of the received packet in the data stream 
(see Table ^ is compared to the packet numbers of packets currently stored in the receiver packet 
buffer. If the packet is already received and confirmed, or if the packet is “older”^ than packets in 
the buffer, it is assumed that the confirmation was possibly lost. In this case, the function simply 
marks that the packet should be confirmed once again. If the packet is “newer” than the packets 
in the current transmission window, it means that a protocol error has occurred - the function sets 
the appropriate error flag and drops the packet. If none of the above conditions applies, the packet 
contains new, unconfirmed data. The length of the packet is verified, and the function checks if 
there is enough free space in the receiver packet buffer. If not, it drops the packet (it will be re¬ 
transmitted again by the FEB). If there is enough free space, data from the packet is copied to 
the corresponding packet slot. 

If the received packet is the “last unconfirmed”, the routine updates the Head pointer. After 
that if the amount of data available in the buffer is higher than the “receiver wake-up threshold” set 
by the user application, the receiving thread is woken up. If the received packet it the “last data 
packet” (see Table ^), the last packet flag is also set. If the last packet flag is set and all packets are 
confirmed, the “end of transmission” flag is set, and the receiving thread is also woken up to receive 
the last part of data. In each case, if required, the confirmation packet is prepared and scheduled 
for transmission. 

4.2 Communication with the user application 

To avoid conflicts when controlling different slaves, each character device {/dev/13ffpga%d) may 
be open only once, by one application. However, the user application may perform two different 
activities: reception of data and sending of control commands. Commands are serviced in a syn¬ 
chronous way and the thread sending the command is put to sleep until the command is executed, 
and the response is received. When the data is transmitted at high speed, it is unacceptable to stop 
the data reception until the command is executed. Therefore, the user application should start an 
additional thread after opening the device so that the reception and processing of data and the exe¬ 
cution of control commands are handled in separate threads. 

To avoid overhead associated with copying data, the receiver packet buffer for each slave 
should be mapped into the appropriate application’s memory using the driver’s mmap function. 
Therefore, the data is copied only once from the socket buffer delivered by the Network Interface 

^The “age” of packets is checked by subtracting their numbers modulo 2?'^. The result below 2^' is considered to be 
a positive number. 
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Table 4. The ioctl commands implemented in the kernel module, to support communication with the user- 
space application. 


IOCTL code 

Description of the commands 

L3_V l_IOC_SETWAKEUP 

Sets the amount of data bytes that must be available in the circular buffer 
before the user-space application is woken up. 

L3_Vl_IOC_GETBUFLEN 

Returns the length of the circular buffer associated with a particular 
FEB. 

L3_V l_IOC_READPTRS 

Returns the number of available data bytes and the positions of the Head 
pointer and Tail pointer in the circular buffer associated with a particular 
FEB. Provides necessary synchronization when accessing the pointers. 

L3_V l_IOC_WRITEPTRS 

Should be called with the number of bytes processed by the applica¬ 
tion. Provides necessary synchronization and updates the Tail pointer 
in the circular buffer associated with a particular FEB. 

L3_V l_IOC_GETMAC 

Associates the FEB identified by the given MAC and connected to 
the given network interface with the particular character device. 

L3_V l_IOC_STARTMAC 

Starts the transmission from the previously associated FEB. 

L3_V l_IOC_STOPMAC 

Stops the transmission from the FEB associated with the particular char¬ 
acter device. 

L3_V l_IOC_FREEMAC 

Disassociates FEB from the particular character device. 

L3_V l_IOC_RESETMAC 

Resets the FADE-1OG core in the FEB associated with a particular char¬ 
acter device. 

L3_V l_IOC_USERCMD 

Sends the user command to the FEB associated with a particular charac¬ 
ter device. This command puts the current thread to sleep until the com¬ 
mand is executed, and the result is sent back. 


Card (NIC) driver to the shared kernel receiver packet buffer^. Of course, the access to such shared 
memory must be appropriately synchronized. That is achieved using the ioctl function. The driver 
implements a set of ioctl commands summarized in Table ^ 


4.2.1 Reception of data 

The user application may read the current positions of the Head pointer and Tail pointer in its 
receiver packets buffer using the L3_Vl_IOC_READPTRS ioctl command. This command ensures 
appropriate synchronization so that the stable Head pointer values are read. Additionally, this 
command returns the number of available bytes in the receiver packet buffer. To avoid active 
waiting for data, the application may define (with the L3_Vl_IOC_SETWAKEUP ioctl command) 
how many bytes of data must be available in the receiver packet buffer before the receiver thread is 
woken up. The thread will be woken up also when the transmission finishes, even if fhe number of 
available byfes is below fhe defined fhreshold. 

®There are technologies offering the true zero copy handling of network data like “Direct NIC Access” pi] ] or 
“PF_RING ZC” plj]. However it is not clear whether they can be used to create a continuous representation of the re¬ 
ceived data in the user application memory without additional copying. The PF_RING ZC still requires single copying 
of the data when used with a standard NIC. 
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4.2.2 Sending the user commands 


When sending a user command, the user fills the structure containing the code of the command, its 
argument, the number of retries and the timeout for each retry. Those parameters allow the user 
to adjust the behavior of the driver to the expected time of execution of the command. The pointer to 
this structure is used as the second argument to the ioctl call. When the ioctl L3_Vl_IOC_USERCMD 
is executed, the current thread is put to sleep until the command response is received or until 
the timeout expires. In the latter case, the command is resent until the given number of retries is 
reached. Together with the functionalities of the FPGA core described in subsection 3.3, this im¬ 
plementation ensures correct single execution of the command, even if either the command packet 
or response packet gets lost. 


5. Tests and results 

The FADE-IOG protocol was tested in different scenarios. The operation at 1 Gbps was verified 
using an Atlys board [ p0| ] and Dell Vostro 3750 (Intel Core i7-2630QM CPU with 2.0 GHz clock). 
10 Gbps operation was verified using the KC705 board [ jl^ and a computer equipped with an Intel 
Core 15-4440 CPU with 3.10 GHz clock. Operation with four 10 Gbps links was verified with 
an AFCK board [|I^ equipped with an FMC board with 4 SFP-i- cages. The board was connected 
to a computer equipped with an Intel Xeon CPU E5-2630 v2 with 2.60 GHz clock. 

Correctness of transmission was tested with the EPGA core sending a preprogrammed se¬ 
quence of data that was later verified by the receiving computer. Transmissions up to 10 Tb 
were tested, and no transmission errors occurred. The achievable transmission speed was equal 
to 990.34 Mbps with 1 Gbps interface in the Atlys board. 

In tests of maximum transmission speed with 10 Gbps links, it was found that verification of 
data led to a decrease of the achievable throughput. The user application was not able to process 
data at the full speed, and the congestion avoidance algorithm was activated. Therefore, the max¬ 
imum throughput tests were performed without a full data verification. Tests with the 10 Gbps 
interface in the KC705 board demonstrated a throughput of 9.815 Gbps. However to achieve such 
a throughput it was necessary to decrease receive interrupt latency in the network adapter with 
the “-C rx-usecs 0” command of the ethtool program. With a standard interrupt latency of 1 qs, 
the achievable throughput was equal to only 6.5 Gbps. 

The CPU load was measured with the top program. The load during the transmission via 
a single link (without data verification) was measured on the computer with an Intel Core i5 CPU. 
The result was equal to 2.15% (0.97% in user processes, 0.31% in system processes and 0.87% in 
software interrupts). 

Operation with four 10 Gbps links in the AECK board working simultaneously has shown 
limitations related to the computer speed. The achieved mean throughput was equal to 9.72 Gbps 
per link. With full data verification, the throughput was further limited to 8.88 Gbps per link. 
The CPU load measured during the transmission through four links without verification was equal 
to 3.90% (0.10% in system processes and 3.80% in software interrupts; load in user processes was 
reported as 0%). The measurements were performed on the computer with an Intel Xeon CPU. 

In the final measurement system, the data delivered to the memory mapped buffer should be 
split into records to be routed to computers in the DAQ network. With the appropriate organization 
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of the readout format, the separation of records should not require checking of each received word. 
Therefore, it is expected that the CPU load related to the routing of data should be significantly 
lower than the load generated by the full data verification. Furthermore, transmission of records to 
the DAQ network via modem NICs should be done using the DMA with minimal CPU load. How¬ 
ever, implementation of the readout protocol and the data routing application is beyond the scope 
of this paper as a responsibility of the end user. 

In summary, the presented test results suggest, that the FADE-IOG system should allow re¬ 
ception and routing of one 10 Gbps data stream with a computer with an Intel Core i5 CPU, and 4 
such streams with a computer with an Intel Xeon CPU. 

Of course, the CPU computational power is not the only bandwidth limiting factor. When de¬ 
signing the system, it is necessary to consider throughput of all I/O interfaces involved in the trans¬ 
fer of data. To check the behavior of the FADE-IOG system in conditions where the output channel 
limits the overall bandwidth, another set of tests was performed with the user application writing 
the data directly to the SATA SSD disk. In this setup, the throughput was limited by the disk. The 
throughput achieved was the same as for the application storing the pre-generated pseudorandom 
data to the disk. The congestion avoidance algorithm has correctly limited the transmission rate 
from the EPGA core. The data stored on the disk was later analyzed, and no corrupted data were 
discovered. 

The last series of tests verified fransmission of fhe user commands in fhe worsf condifions. 
The dedicafed applicafion fransmiffed user commands during reception of fhe continuous sfream 
of dafa. In fhe I Gbps sefup, fhe EPGA core was able fo execute 2870 commands per second wifh 
dafa fransmission speed unaffected. In fhe 10 Gbps sefup, fhe EPGA core was able fo execute ca. 
40000 commands per second wifhouf impairing fhe fransmission speed. 

The profocol has been optimized for operation af high dafa rafe. Eor lower dafa rafes, fhere is 
a danger fhaf dafa may waif foo long until fhe dafa packef is completed and fransmiffed. To avoid 
fhis, fhe dafa source should provide a minimum flow of dafa (e.g. “time sfamps” or “dummy dafa”) 
in fhe low dafa rafe condifions. Thai ensures fhaf fhe delay, associated wifh fhe completion of 
fhe dafa packef, is accepfable. This approach keeps fhe profocol as simple as possible buf requires 
fhe user fo enforce fhe minimal required dafa flow in fhe upper layer. As fhaf additional dafa is 
inserfed only in low fraffic conditions, ifs removal in fhe receiver application should nol signifi- 
canfly increase fhe CPU load. The alternative solufion could be sending incomplefe packefs after 
a user defined fimeouf. Such a packef could have a similar payload as fhe “lasf packef”, shown in 
Table buf should be marked as anofher packef fype (e.g. 0xa5a7). Unforfunafely, handling such 
packefs af fhe receiver side would break fhe idea of an efficienl, fransparenf passing of fhe received 
dafa sfream info fhe memory buffer direcfly available for fhe user applicafion and would resulf in 
a drop of efficiency af a high dafa rafe. 

6. Conclusions 

The presented EADE-IOG system allows reliable fransmission of measuremenf dafa from an EPGA 
via a 1 Gbps or 10 Gbps interface fo a compufer running Einux OS. The system can almosf fully 
ufilize fhe link fhroughpuf. Apart from fransmission of dafa, fhe sysfem implemenfs simple confrol 
or diagnosfic commands, which are reliably fransmiffed fo fhe EPGA. The resulfs of fhe command 
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execution are also reliably transferred to the computer. Even with a fully occupied link the system 
executes over 2800 commands per second with 1 Gbps link, and ca. 40000 commands per second 
with 10 Gbps link. The system minimizes packet acknowledgment latency that in turn allows 
the reduction of the amount of memory needed in the FPGA to buffer the data. Additionally, 
the system implements a special “early retransmission” mechanism, which reduces the latency of 
the data retransmission in case of a lost packet. The data received by the computer is delivered to 
the user application using the memory mapped kernel buffer, that avoids unnecessary data copying 
and reduces the CPU load. 

The FADE-IOG system is implemented in a possibly simple way and published under permis¬ 
sive licenses (most parts under BSD license, some under GPF license and some as public domain). 
Therefore, it can be a good base solution for further development of a transmission system suited 
to a particular experiment. Sources of the FADE-IOG project are available on the OpenCores web¬ 
site [Q. 
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